{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "327d1de5",
   "metadata": {},
   "source": [
    "# Deploy Llama3 with TensorRT-LLM\n",
    "\n",
    "Welcome!\n",
    "\n",
    "In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.\n",
    "\n",
    "Once the TensorRT engine is build, you can use the run.py script provided at the end of this notebook or use this engine as in input to the Triton Inference Server. \n",
    "\n",
    "See the [Github repo](https://github.com/NVIDIA/TensorRT-LLM) for more examples and documentation!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0b67443",
   "metadata": {},
   "source": [
    "### Step 1 - Install TensorRT-LLM\n",
    "\n",
    "We first install TensorRT-LLM. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e39eb57b-2e6a-448c-b0ac-f9c065e486ac",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "707bc778-169c-4d3d-8836-2d1afdce48f6",
   "metadata": {},
   "source": [
    "### Step 2 - Download Llama3 model weights\n",
    "\n",
    "Llama3 is a gated model which means you'll need to request approval on their respository and generate a HF token. This usually takes about 20 minutes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "220e7c94-6acf-4562-bb8f-c2b6bbb52dcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import huggingface_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa7b31d9-d6cf-40ec-8132-569b498ba34d",
   "metadata": {},
   "outputs": [],
   "source": [
    "huggingface_hub.login(\"<ENTER TOKEN HERE>\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d824340-b183-4cea-bc95-d2d7592ce349",
   "metadata": {},
   "outputs": [],
   "source": [
    "huggingface_hub.snapshot_download(\"meta-llama/Meta-Llama-3-8B-Instruct\", local_dir=\"llama3-hf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4333ef85-a14e-485d-aea0-2f7f055f3dc3",
   "metadata": {},
   "source": [
    "### Step 3 - Convert checkpoints into safetensors and build the TRT engine\n",
    "\n",
    "There are 2 substeps here. The first is converting the raw huggingface model into safetensors which is a safe and fast format for storing tensors. \n",
    "\n",
    "Next we build the TensorRT engine. This is where the magic happens. We take the converted safetensors model and convert it into a `TensorRT engine`. Engines are optimized versions of models built to run lightening fast on the current machine.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94f5d76b-87a1-4275-9205-ec811676a814",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -L https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e08e8d92-c303-4cff-b33e-da9e51eac3f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python convert_checkpoint.py --model_dir llama3-hf \\\n",
    "    --output_dir ./llama3-safetensors \\\n",
    "    --dtype bfloat16"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a6fe149-a330-4d8b-a2c7-af915c8cb1b9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!trtllm-build --checkpoint_dir llama3-safetensors \\\n",
    "    --output_dir ./llama3engine_bf16_1gpu \\\n",
    "    --gpt_attention_plugin bfloat16 \\\n",
    "    --gemm_plugin bfloat16"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a582f71d-5d83-4599-a78a-fc160ef9b47e",
   "metadata": {},
   "source": [
    "### Step 4 - Run the model using the example script!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5b8fb1d-addf-48bb-aac3-bd09cc1ffa9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/NVIDIA/TensorRT-LLM.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "216c03ec-aca1-40fe-8532-3d903314cbb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./TensorRT-LLM/examples/run.py --engine_dir=llama3engine_bf16_1gpu \\\n",
    "    --max_output_len 100 \\\n",
    "    --tokenizer_dir llama3-hf \\\n",
    "    --input_text \"How do I count to nine in French?\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
